image.png

My Capstone Project :

Surrounding Business and Competitive analysis research for New Restaurant Location

Introduction: Business Problem

When new or existing restaurant owners are trying to decide on a new location for their restaurant, there are multiple key factors for them to consider and understand before they decide on a new location.

Based on my initial research of restaurant location services and restaurant owner magazines, I identified several key factors necessary for choosing a restaurant business location.

The key factors for site location include visibility, parking, space size, crime rates, surrounding businesses and competitor analysis, accessibility, and safety. This criteria likely applies to other small to medium sized businesses.

For the purpose of my Capstone project, I have narrowed my choice to the area of “Understanding Surrounding Businesses and Competitor analysis for New Restaurant Locations” as the key problem for my project to address.

Background Discussion

I begin with Lisa Melbourne, a fictitous and hypothetical person who is the restaurant owner of “Breaking Bread”, a highly successful breakfast-only restaurant operating in Southern California. Lisa is looking to expand her restaurant operations into Northern California, with key interest to find a location within the San Francisco South Bay region. Lisa has enlisted my services and needs my research and recommendations to identify a city and location with increasing/high cross-business traffic and area having fewest breakfast venues within a couple of miles so she is not launching her restaurant into a competitive atmosphere.

The following data science approach phases will be performed on this project:

  1. Business Understanding, problem description, stakeholder research
  2. Data collection and gathering from various endpoints
  3. Preprocessing of data
  4. Algorithms, Tools, Machine Learning
  5. Visualizations
  6. Results and conclusions

Data - Description, Use in solving problem

My goal is to provide Lisa with the research data she requires so she can effect an informative decision regarding a new restaurant location. The collection process must include all of the following:

  1. First dataset is neighborhood data comprised using postal codes found for San Jose and immediately surrounding areas. San Jose postal zipcode data was scraped from the following data sources: https://tools.usps.com/find-location.htm https://www.postallocations.com/ca/county/santa-clara The zipcodes will be useful for defining neighborhood references and for geo mapping.

  2. Project will comprise data from a mixture of csv files, GET Requests in json form, Venue API json data, and geo location api.

  3. https://www.sanjose.org/restaurants?field_city_value=san jose#restaurants-listing

    • List of best restaurants in San Jose and surrounding areas
  4. Foursquare data through API will be used to meet several requirements:

    • San Jose regional data showing all business venues required and segmented as follows:
      • Identify all restaurants in San Jose and surrounding area who are serving breakfast and with ability to assess venue ratings to see how well businesses are performing
      • Surrounding businesses of all restaurants within targeted neighborhoods with goal to understand customer traffic data to assess how well businesses are doing within San Jose region.
      • Isolating specific search to see how breakfast restaurants are doing
      • Data by zip/suburb to identify specific sub-regions within San Jose our client is interested in.
      • Business customer traffic data so client can see which business venues and locations their customers are coming from.
  1. Population Density dataset and analysis in our region of restaurant location interest is retrieve from california population dataset arranged by zipcode. Data is available from https://www.california-demographics.com/zip_codes_by_population".

    • The goal and thought for population data is for data inference so we can deduce properties of underlying distribution. Sales, revenue, and venue traffic are all driven by supply and demand. By referencing California population data on zip codes, the goal here is understand restaurant ratio to dense and low populated areas. Population density can help customer target not only restaurant location but unmet area needs on food selection.

    • Bottom line, it's for understanding potential venue growth traffic, to a small area, as well as geographical opportunities for property placement based on this notion.

Methodology - Exploratory Analysis, Statisical Exam, Machine Learning

  1. Create, wrangle, and clean our initial steps.
  2. Develop a neighborhood dataset
    • Get neighborhood dataframe constructed, clean and working
  3. Get folium mapping configured and working
  4. All San Jose Zipcodes and surrounding areas of interest will be converted to sj_zipcode dataframe
  5. Zipcode data will be used to segment neighborhoods
  6. All neighborhoods/zipcodes will have designated latitude and longitudes using geocode information from http://cocl.us/Geospatial_data with all neighborhood data merged with geocoded data
  7. For Competitive Analysis

    • At a San Jose level, provide method to show unique grouping of breakfast venues.
    • Method to describe the unique number and percentage of restaurants operating within San Jose and organized to show level of competition and saturation levels.
    • Method for Location identification:
      • For customer to further refine their area of interest, muliple geolocation mapping is provided for client so client can select and best refine their venue selection within a desired radius.
      • Popup data is shared showing number of breakfast restaurants in each selected area.
      • Saturation level of other restaurant venues serving breakfast and their proximity spacing to customer can is overlaid to map plot area of interest for customer review.
      • Population density to understand population ratio to restaurant ratios for neighborhoods.
  8. Foursquare data formats we will use include: v2/venues/search, v2/venues/explore, v2/venues/nextvenues

  9. Functions will be developed to reduce replication.
  10. New function and section will be created for specific development of dataframe to show source venue traffic data coming into our restaurant venue (next_venue)
  11. Most common venues dataframe will be utilized for San Jose
  12. A new dataframe will be created to show cluster as well as the top 10 venues for each neighborhood.
  13. Top breakfast serving location dataset and methods
  14. Establishing name, address, city, state and zip and FullAddress combine for foursquare search
  15. Geocoding setup so we can extract lat and long coordinates anywhere we have a full physical address. Then taking the data and updating geopoints to the respective dataframes
  16. Setting up notion of mean latitude and longitude of all our location datesets so we can center the foliun map on our dataset.
  17. Setting up multiple marker styles and child maps to combine overlay on a single view.
  18. Customizing folium markers for improvide visualization and adding venue name and location data in the popup when selected by mouse.

EXHIBITS

1. Let's jump in. First resulting dataset (first 5 rows of data):

image.png

2. Example of Folium Choropleth map showing Neighborhood data and shading by geographic region of our venue tracking research

image.png

3. Example of Folium circle marker showing Neighborhood data, popup data during mouse selection.

image.png

4. Top breakfast locations example

image.png

5. Folium map marker example showing overlay of two child maps with mouse-over pop up on marker icons for additional information. Selecting on the larger circle marker will display neighborhood location.

image.png

6. Same view as above, except here you see blue markers which provide business and geographic information for restaurant venues per mouse selection. HTML frames or IFrames can be configured with images to capture deeper detail.

image.png

7. Example of our Foursquare API queried venue data reporting in explore mode within radius perimeter and geopoint data used for search.

image.png

8. This image is a Matplotlib scatter plot intended to show the distribution of Top Breakfast spots across different cities. As of now, with our project focused to San Jose and the data being sourced from SanJose.org, you can see that Top rated breakfast locations are mostly concentrated within San Jose. Besides a few reputable cuisine outliers we see in Woodside, santa Cruz and Santa Clara. Note there are only for 24 of the 60+ we have for top breakfast venues sourced. I am considering a swarm plot and boxplot images for next update.

image.png

9. Here we see restaurant venues, alphabetically and grouped by number of restaurants by type. There were 1170 restaurants captured using Foursquare. Time of day operation or other means will have to be used to identify breakfast only venues, which will be done in an update release which I have committed to do.

image.png

image.png

11. Let's print each neighborhood along with the top 5 most common venues

image.png

  • Kmeans is a highly popular machine learning tool for its ablility to find clusters within data by grouping unlabled and unsupervised data sets into multiple clusters. It uses different algorithmic methods to perform partition clustering, dividing the data into non-overlapping subsets (clusters).
  • One popular application for using kmeans would be identifying purchasing activity of customer based groups, i.e., ideal for the venue traffic patterns which is what I share below where you can see Cluster labels representing 5 cluster groups, with all clusters being ranked beween first most common venue with activity, to 10th most common.

image.png

13. Here we see our top 10 most highest frequency examples with Central Park being the most active and common venue followed by Mexican food.

image.png

14. Population Density data added to Combined Cluster and Neighborhood Report

image.png

15. Final Analysis - Observations and Recommendations

  1. Everything completed as required and in some cases exceeded original plan.

  2. One drawback was getting reliable postal information.

  • Postallocations.com was used to derive an initial postal csv file.
  • US postal service site was also used to cross-check information. Important to note is that zipcodes are commonly used for associating neighborhoods. However, zipcode locations are typically identified by post office address. For that reason, we rely the postal routes to define regions. Which is okay. However, if I am querying on neighborhood region and want venues within 2k meters from my postal location, there is no precision guarantee that the post office is in a low populated edge of town or in the center of the busiest town. For that reason, my next update will include full population detail by zipcode. There is no better metric in my opinion for measuring hunger that drives restaurants. The more hungry people you have is key. For now, we use zipcodes.
  • One more note on postal codes. P.O. boxes need to be removed from the set. These can be remote offices, having nothing to offer nothing but confusion for providing geolocation assistance on a zip.
  1. Future: Add more venues, scatter and swarm plots for more venues by City
  2. The following tools were used:
    • Python - the glue behind processing everything
    • Numpy - for handling data in a vectorized manner
    • Pandas - data loads, stores, modifications, dataframe construction & manipulation, remote GET requests, flattening json data
    • Json - library for handling JSON files
    • Geopy.Geocoders and Nominatim - For converting physical address into geopoint latitude and longitude coordinates
    • Matplotlib - and associated plotting modules
    • sklearn.cluster - for performing K-Means clustering
    • Folium - for map rendering library

16. Result Findings

  1. All requirements and project deliverables were met. Found additional data stores and leveraged for project.
  2. Neighborhood targeting completed. 70 neighborhoods were identified for exploration and reporting.
  3. Geolocation data was built into the datasets and used to identify venue locations.
  4. 26 Top breakfast locations, rated by sanjose.org were pulled in to the dataset for exploration, development, and anlaysis.
  5. Geographical maps were developed and plotted, including layered map overlays for comparing neighborhood regions with location of top rated breakfast restaurants. Highly useful for competitive analysis..
  6. 1170 restaurant venues were identified in total across
  7. For all target neighborhood locations, top 5 venues were reported and made available for review.
  8. Using machine learning, specifically K-Means cluster, identifed the top 10 Most Common Venues reported by Neighborhood and venue type.
  9. Visual maps were developed for reviewing
  10. Population density data was merged with cluster data. This was a late add that I will continue to develop its capabilities for understanding potential venue growth traffic, to a small area, as well as geographical opportunities for property placement based on this notion.
  11. Kmeans clustering was very effective for the clustering and venue activity ranking process.

Conclusion

Highlights:

Regarding the IBM Data Science Professional Program. First off, thank you to the IBM teaching staff, technical team, assistants, and learning facilitators between IBM and Coursera. This has been a professional, fun, and good challenging experience. I also appreciate my time with my fellow students. This has been a good journey.

A few parting thoughts upon IBM Data Science Professional course conclusion:

  1. I personally recommend the IBM Data Science Professional for project and program managers, architects, engineers, medical, finance, biotech, medicine or anyone with a business need or hobby, who wants a solution for telling stories with data and visualizations.

  2. Course modules are structured and enriched with coursework and information you will save and use as future reference. I am confident with the top experts developing curriculum and conducting the training with Coursera, fasciliting, training and delivering highest quality learning experiences. The Data Science Professional courses I took through Coursera were all IBM sponsored with solid instructors and PhDs who are well experienced and who provide full and exciting assignment challenges.

  3. I highly encourage others to take a course or two online to get a taste.

  4. With python and applied data science and machine learning, we can identify data solutions for a vast array of client needs and in many cases be more effective than conventional means, i.e, moving away from spreadsheets and moving onto Pandas. Finance folks are seeing this learning opportunity at a growing pace.

  5. We can explore and identify many customer use and problem cases with stronger precision and accuracy than conventional business methods. The toolbox is large. This is not a broad statement, as numerous leading edge tools still have their place.

  6. This particular project required research from a variety of locations where data was stored and retrieved with quick need for working form with data harnessed in a variety of formats. It was pleasant leap into advanced technicals and thinking outside the box in crunch time.

Thank you.

I can be reached at ericluiggi@gmail.com. Feel free to reach out to me, provide your review or drop a comment with. Thanks again!

In [ ]: